The imputeORS package provides to tools to impute missing values in the Occupational Requirements Survey (ORS) administered by the U.S. Bureau of Labor Statistics (BLS). It uses an iterative, machine learning based approach. The package contains functions to retrieve, process, and analyze the data, and produces useful visualizations of both the iterative process and the final imputed results. In this vignette, we go through the full analysis stream for a subset of the data.

1 Overview of ORS Data

Occupational Requirements Survey. The Bureau of Labor Statistics (BLS) collects numerous data to capture important information regarding the workforce in the United States. One of the tools they use to do this is the Occupational Requirements Survey (ORS), which is a comprehensive survey that gathers information regarding the various demands and conditions (requirements) of different occupations within the U.S. economy. These requirements fall into one of four categories: physical demands; environmental conditions; education, training, and experience; and cognitive and mental requirements. The initial data that were analyzed were publicly available datasets known as “Wave 1” and “Wave 2.” The survey is accomplished in repeated waves across several years; the data analyzed from the public set constituted Wave 1, year 2 and Wave 2, year 1. In this vignette, we focus exclusively on a subset of the Wave 1 data.

Structure of ORS Data. The data used in this vignette are publicly available from the BLS ORS, and are included within the imputeORS package. Importantly, this survey captures the distribution for the minimum job requirements of various occupations, rather than the actual qualifications of individuals working in these occupations (i.e., the survey does not capture information of under- and/or over-qualified employees). Additionally, this dataset covers only civilian occupations.

Numerous types of data are captured in the ORS so we focus on the information we identified as relevant to our goal of imputing missing estimates for population distributions of occupational requirements. For any given observation, the relevant fields utilized from the ORS were as follows:

Character fields:

Numeric fields:

2 Preprocessing

The preprocessing stage involves a number of steps, detailed below.

Raw Data and Transform. First, we obtain the raw data from the ORS website, and transform it into the format used by the package. We see a sample of this transformed data, below.

#>       upper_soc_code                              occupation_text
#> 751         11101100                             Chief Executives
#> 3000        11303100                           Financial Managers
#> 10001       13115100         Training and Development Specialists
#> 50000       31909100                            Dental Assistants
#> 65001       39909900 Personal Care and Service Workers, All Other
#> 70000       43301100                  Bill and Account Collectors
#>                            data_element_text  data_type_text estimate_type_text
#> 751                 Post-employment training             YES           Estimate
#> 3000           Lifting/carrying Occasionally      NEGLIGIBLE     Standard error
#> 10001 Pre-employment training: Certification              NO           Estimate
#> 50000               Post-employment training             YES     Standard error
#> 65001                                Sitting 90th Percentile           Estimate
#> 70000                     Near visual acuity             YES     Standard error
#>       estimate_type unit_of_measure reference_group additive_group value
#> 751         Percent      Percentage              14             14  49.7
#> 3000           Mode      Percentage              25             NA   9.5
#> 10001       Percent      Percentage              89             NA  29.7
#> 50000       Percent      Percentage              14             14   1.6
#> 65001    Percentile  Percent of day              78             NA  70.0
#> 70000       Percent      Percentage              52             52  -6.0

Synthetic Additive Groups and Filtering. Next, we manually create a handful of additive groups using the available data, a step that was added on the advice of SSA analysts. These groups are:

The numbers assigned to these synthetic additive groups are derived from the reference_group field associated with the relevant observations. After the creation of these groups, we trim the data to only include observations associated with a valid additive group. This is all accomplished using the function syntheticAddGroups(), below.

x <- syntheticAddGroups(ors.from.csv)

Missing Observations and Corrections. Once the data is trimmed, we then go about the task of “completing” the data. One of the gaps in the data is the fact that some observations within an occupational group are unlisted (i.e., not only is the estimate missing, but the observation itself is not in the original data). To fix this issue, we generate all possible levels within a given occupational group, and then merge these observations with those available in the original dataset to produce a “complete” dataset. This is accomplished using the fillMissingObservations() and otherMissingObservations() functions. Note, because these additional observations do not have estimates associated with them, they become part of the imputation problem moving forward. Some additional corrections are also applied here (see function dataCorrections() for more details).

x.complete <- fillMissingObservations(x)
x.complete <- otherMissingObservations(x.complete)
x.complete <- dataCorrections(x.complete)

Separate Means and Errors. In this step we separate the complete data into two data frames: one exclusively containing mean estimate observations, and another exclusively containing error observations. These are merged in a later step such that rather than having separate observations for the estimate and error for a given measure, they are assigned to the same observation (wide data rather than long data).

x.complete.means <- getMeanEstimates(x.complete)
x.complete.errors <- getErrors(x.complete)

Direct Calculation. Some of the observations with missing estimates (including some of those generated by data completion) can be filled in based on the available information. For example, some requirements are binary in nature, e.g. the requirement Prior work experience has only two levels: YES, and NO. Given one estimate, the other can be calculated. Similarly, for requirements where all but one estimate was known, the final one can easily be determined. Each occupational group is assessed to determine whether it is such an “N-1 group,” and those that are have their missing estimate calculated and added to the data.

ors.data <- fillNminusOneGroups(x.complete.means)

Assign Errors, Bounds, Predictors, and Weights. In this step, we format the data for modeling. We first assign the errors to their relevant observations, and add bounds to each estimate based on the errors (where available). For observations missing an estimate/error, the bounds were calculated based on the remaining available percentage within the occupational group. We also add predictor columns, derived from the requirement categories, requirements, and requirement levels associated with each observation. This mapping is stored in a variable called predictors.data, a sample of which is shown below. Finally, we add default modeling weights to each observation, based on the presence/absence of an estimate.

#>                       data_element_text data_type_text additive_group
#> 11              Pre-employment training            YES             12
#> 12                Prior work experience             NO             13
#> 13                Prior work experience            YES             13
#> 14                  On the Job Training             NO             14
#> 15                  On the Job Training            YES             14
#> 16 Sitting vs. standing/walking at will             NO             15
#> 17 Sitting vs. standing/walking at will            YES             15
#> 18                    Reaching overhead    NOT PRESENT             16
#> 19                    Reaching overhead   OCCASIONALLY             16
#> 20                    Reaching overhead         SELDOM             16
#>                             Requirement Frequency Intensity
#> 11              Pre-employment training       100       100
#> 12                Prior work experience         0       100
#> 13                Prior work experience       100       100
#> 14                  On the Job Training         0       100
#> 15                  On the Job Training       100       100
#> 16 Sitting vs. standing/walking at will         0       100
#> 17 Sitting vs. standing/walking at will       100       100
#> 18                             Reaching         0       100
#> 19                             Reaching        33       100
#> 20                             Reaching         2       100
#>    Requirement_category
#> 11                  SVP
#> 12                  SVP
#> 13                  SVP
#> 14                  SVP
#> 15                  SVP
#> 16                  PHY
#> 17                  PHY
#> 18                  PHY
#> 19                  PHY
#> 20                  PHY
ors.data <- errorsAndBounding(x.complete.errors,ors.data)
ors.data <- getPredictors(ors.data)
ors.data <- setDefaultModelingWeights(ors.data)

Note that here we remove all observations associated with the occupation_text All Workers, and we would advise doing the same, because it is a summary measure of all occupations considered in the ORS. For the purposes of this vignette only, we consider just four of the 22 possible 2-digit SOC codes. We include Table 2.1, listing the number of observations associated with each of the SOC2 codes retained. A guide to these codes can be found online at the BLS website. We also provide a snapshot of the data.

ors.data <- droplevels(ors.data[-which(ors.data$occupation_text=="All Workers"),])
ors.data <- droplevels(ors.data[which(ors.data$upSOC2 %in% c("11","13","15","19")),])
Table 2.1: Number of observations for each of the included SOC2 groups.
SOC2 code Count
11 5916
13 5508
15 3672
19 1632
#>                     occupation_text upper_soc_code data_element_text
#> 1                       Accountants       13201101               SVP
#> 2                       Accountants       13201101               SVP
#> 3                       Accountants       13201101               SVP
#> 4                       Accountants       13201101               SVP
#> 5                       Accountants       13201101               SVP
#> 6                       Accountants       13201101               SVP
#> 7                       Accountants       13201101               SVP
#> 8                       Accountants       13201101               SVP
#> 9                       Accountants       13201101               SVP
#> 10                        Actuaries       15201100               SVP
#> 11                        Actuaries       15201100               SVP
#> 12                        Actuaries       15201100               SVP
#> 13                        Actuaries       15201100               SVP
#> 14                        Actuaries       15201100               SVP
#> 15                        Actuaries       15201100               SVP
#> 16                        Actuaries       15201100               SVP
#> 17                        Actuaries       15201100               SVP
#> 18                        Actuaries       15201100               SVP
#> 19 Administrative Services Managers       11301100               SVP
#> 20 Administrative Services Managers       11301100               SVP
#>             data_type_text additive_group value std.error pct.remaining
#> 1   > 1 MONTH,  = 3 MONTHS             10    NA        NA         0.040
#> 2     > 1 YEAR,  = 2 YEARS             10 0.017     0.005            NA
#> 3    > 2 YEARS,  = 4 YEARS             10 0.313     0.031            NA
#> 4  > 3 MONTHS,  = 6 MONTHS             10    NA        NA         0.040
#> 5   > 4 YEARS,  = 10 YEARS             10 0.630     0.040            NA
#> 6    > 6 MONTHS,  = 1 YEAR             10    NA        NA         0.040
#> 7   >SHORT DEMO, = 1 MONTH             10    NA        NA         0.040
#> 8            OVER 10 YEARS             10    NA        NA         0.040
#> 9          SHORT DEMO ONLY             10    NA        NA         0.040
#> 10  > 1 MONTH,  = 3 MONTHS             10    NA        NA         1.000
#> 11    > 1 YEAR,  = 2 YEARS             10    NA        NA         1.000
#> 12   > 2 YEARS,  = 4 YEARS             10    NA        NA         1.000
#> 13 > 3 MONTHS,  = 6 MONTHS             10    NA        NA         1.000
#> 14  > 4 YEARS,  = 10 YEARS             10    NA        NA         1.000
#> 15   > 6 MONTHS,  = 1 YEAR             10    NA        NA         1.000
#> 16  >SHORT DEMO, = 1 MONTH             10    NA        NA         1.000
#> 17           OVER 10 YEARS             10    NA        NA         1.000
#> 18         SHORT DEMO ONLY             10    NA        NA         1.000
#> 19  > 1 MONTH,  = 3 MONTHS             10    NA        NA         0.189
#> 20    > 1 YEAR,  = 2 YEARS             10    NA        NA         0.189
#>    upper_bound lower_bound upSOC2 upSOC3 upSOC4 requirement frequency intensity
#> 1        0.040       0.000     13    132   1320         SVP       100         4
#> 2        0.022       0.012     13    132   1320         SVP       100        20
#> 3        0.344       0.282     13    132   1320         SVP       100        40
#> 4        0.040       0.000     13    132   1320         SVP       100         5
#> 5        0.670       0.590     13    132   1320         SVP       100        60
#> 6        0.040       0.000     13    132   1320         SVP       100         8
#> 7        0.040       0.000     13    132   1320         SVP       100         1
#> 8        0.040       0.000     13    132   1320         SVP       100       100
#> 9        0.040       0.000     13    132   1320         SVP       100         0
#> 10       1.000       0.000     15    152   1520         SVP       100         4
#> 11       1.000       0.000     15    152   1520         SVP       100        20
#> 12       1.000       0.000     15    152   1520         SVP       100        40
#> 13       1.000       0.000     15    152   1520         SVP       100         5
#> 14       1.000       0.000     15    152   1520         SVP       100        60
#> 15       1.000       0.000     15    152   1520         SVP       100         8
#> 16       1.000       0.000     15    152   1520         SVP       100         1
#> 17       1.000       0.000     15    152   1520         SVP       100       100
#> 18       1.000       0.000     15    152   1520         SVP       100         0
#> 19       0.189       0.000     11    113   1130         SVP       100         4
#> 20       0.189       0.000     11    113   1130         SVP       100        20
#>    req.cat known.val weight
#> 1      SVP         0    0.5
#> 2      SVP         1    1.0
#> 3      SVP         1    1.0
#> 4      SVP         0    0.5
#> 5      SVP         1    1.0
#> 6      SVP         0    0.5
#> 7      SVP         0    0.5
#> 8      SVP         0    0.5
#> 9      SVP         0    0.5
#> 10     SVP         0    0.5
#> 11     SVP         0    0.5
#> 12     SVP         0    0.5
#> 13     SVP         0    0.5
#> 14     SVP         0    0.5
#> 15     SVP         0    0.5
#> 16     SVP         0    0.5
#> 17     SVP         0    0.5
#> 18     SVP         0    0.5
#> 19     SVP         0    0.5
#> 20     SVP         0    0.5

3 Model Tuning

Next, we run through an example of model tuning. Our approach broke this down into two stages: train-test, and k-folds cross validation. Both stages utilize an interative modeling approach. In the train-test stage, model tuning is accomplished using only the known data. In order to mimic prediction, a subset of the data (test set) is thrown out at the beginning of the procedure to create mock missing data. In the k-folds stage, both the known and the missing data are used. The test fold is selected from the known data, creating mock missing data in the same manner as in the train-test stage.

3.1 Train-test

The train-test paradigm was used to determine the model parameters to use moving forward, specifically the parameters nrounds, max_depth, and eta. In practice, this procedure was run multiple times until we optimized the parameters. Here, we simply run it once with the final parameterization that was obtained. The data are first separated into train/test/validate sets. Then, initial guesses are populated for the test set, followed by iterative model prediction.

model.results <- iterateModelTT(ors.data=ors.data,
                                n.iter=5,weight.step=0.05,
                                mdl.d=14,mdl.n=200,mdl.e=0.6)
#> [1] "Iteration 0..."
#> [1] "Generating test/train/validate sets..."
#> [1] "Generating guesses in test set..."
#> [1] "Iteration 1..."
#> [1] "Evaluating model..."
#> [1] "Iteration 2..."
#> [1] "Evaluating model..."
#> [1] "Iteration 3..."
#> [1] "Evaluating model..."
#> [1] "Iteration 4..."
#> [1] "Evaluating model..."
#> [1] "Iteration 5..."
#> [1] "Evaluating model..."

3.2 K-folds Cross Validation

Next, we use k-folds cross validation (KFCV) to determine the blending ratio for our ensemble method. This approach first uses the smartGuess() function to generate an initial prediction for the missing values (both actual and mock), and then using an iterative modeling procedure to refine these predictions. The full process involves running the KFCV paradigm twice: once using SOC2 codes to guide smart guessing, and once using SOC3 codes. The results of these models are then blended using a ratio calculated from the convergence iterations of the individual models.

K-folds Cross Validation Modeling. During the modeling stage, the predictions are initialized using smart guessing and then refined in an iterative fashion that leverages XGBoost. Once this was done using both the SOC2 and the SOC3 codes to guide smart guessing, the predictions for the test folds from each iteration are collated to generate a “full” predicted dataset for each iteration. A sample of the collation results from the SOC2 model are shown below, with Prediction0 corresponding to the output of the smart guess procedure, and all subsequent predictions corresponding to those generated by the iterative modeling process. Note that in these results, the first column fold indicates which fold a given observation belongs to, the second column req.cat indicates the requirement category of the given observation, and the third column actual gives the actual value of the estimate (before it was thrown out).

# Use SOC3 codes to guide smart guessing, and model
soc3.mdl <- iterateModelKFCV(ors.data=ors.data,
                             n.iter=10,weight.step=0.05,
                             mdl.d=14,mdl.n=200,mdl.e=0.6,
                             fold.list=NULL,sg.soc.code="upSOC3")
#> [1] "Iteration 0..."
#> [1] "Generating folds..."
#> [1] "Iteration 1..."
#> [1] "Iteration 2..."
#> [1] "Iteration 3..."
#> [1] "Iteration 4..."
#> [1] "Iteration 5..."
#> [1] "Iteration 6..."
#> [1] "Iteration 7..."
#> [1] "Iteration 8..."
#> [1] "Iteration 9..."
#> [1] "Iteration 10..."
# Use SOC2 codes to guide smart guessing, and model
# (Use the folds generated during the SOC3 phase)
fl <- soc3.mdl$folds
soc2.mdl <- iterateModelKFCV(ors.data=ors.data,
                             n.iter=10,weight.step=0.05,
                             mdl.d=14,mdl.n=200,mdl.e=0.6,
                             fold.list=fl,sg.soc.code="upSOC2")
#> [1] "Iteration 0..."
#> [1] "Using provided folds..."
#> [1] "Iteration 1..."
#> [1] "Iteration 2..."
#> [1] "Iteration 3..."
#> [1] "Iteration 4..."
#> [1] "Iteration 5..."
#> [1] "Iteration 6..."
#> [1] "Iteration 7..."
#> [1] "Iteration 8..."
#> [1] "Iteration 9..."
#> [1] "Iteration 10..."
# Get test fold data
test.folds2 <- getTestFoldData(soc2.mdl)
test.folds3 <- getTestFoldData(soc3.mdl)
head(test.folds2)
#>                                                    fold req.cat actual
#> Accountants10>1YEAR,=2YEARS                           3     SVP  0.017
#> Accountants10>2YEARS,=4YEARS                          3     SVP  0.313
#> Accountants10>4YEARS,=10YEARS                         6     SVP  0.630
#> AdministrativeServicesManagers10>2YEARS,=4YEARS       9     SVP  0.178
#> AdministrativeServicesManagers10>4YEARS,=10YEARS      7     SVP  0.633
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS    4     SVP  0.568
#>                                                    Prediction0 Prediction1
#> Accountants10>1YEAR,=2YEARS                         0.02456839  0.02275250
#> Accountants10>2YEARS,=4YEARS                        0.05925896  0.06338542
#> Accountants10>4YEARS,=10YEARS                       0.18757081  0.18759358
#> AdministrativeServicesManagers10>2YEARS,=4YEARS     0.04587500  0.06337862
#> AdministrativeServicesManagers10>4YEARS,=10YEARS    0.10275000  0.31418519
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS  0.70700000  0.71037892
#>                                                    Prediction2 Prediction3
#> Accountants10>1YEAR,=2YEARS                         0.01997569  0.01899637
#> Accountants10>2YEARS,=4YEARS                        0.06987788  0.07096555
#> Accountants10>4YEARS,=10YEARS                       0.19503461  0.20154545
#> AdministrativeServicesManagers10>2YEARS,=4YEARS     0.07842219  0.09856568
#> AdministrativeServicesManagers10>4YEARS,=10YEARS    0.39649393  0.46587905
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS  0.70914092  0.70320724
#>                                                    Prediction4 Prediction5
#> Accountants10>1YEAR,=2YEARS                         0.01825424  0.01751429
#> Accountants10>2YEARS,=4YEARS                        0.07141774  0.07495140
#> Accountants10>4YEARS,=10YEARS                       0.19663334  0.19949098
#> AdministrativeServicesManagers10>2YEARS,=4YEARS     0.10032029  0.11351600
#> AdministrativeServicesManagers10>4YEARS,=10YEARS    0.49373051  0.52556225
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS  0.68931010  0.68215677
#>                                                    Prediction6 Prediction7
#> Accountants10>1YEAR,=2YEARS                         0.01959155  0.01536126
#> Accountants10>2YEARS,=4YEARS                        0.08175797  0.09033387
#> Accountants10>4YEARS,=10YEARS                       0.20546383  0.20568970
#> AdministrativeServicesManagers10>2YEARS,=4YEARS     0.10088023  0.10014714
#> AdministrativeServicesManagers10>4YEARS,=10YEARS    0.51506374  0.51746200
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS  0.67802113  0.67973847
#>                                                    Prediction8 Prediction9
#> Accountants10>1YEAR,=2YEARS                         0.01386481  0.01033710
#> Accountants10>2YEARS,=4YEARS                        0.09518533  0.09667799
#> Accountants10>4YEARS,=10YEARS                       0.20864541  0.21316246
#> AdministrativeServicesManagers10>2YEARS,=4YEARS     0.09679408  0.09572176
#> AdministrativeServicesManagers10>4YEARS,=10YEARS    0.53426800  0.53710864
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS  0.67523263  0.67540517
#>                                                    Prediction10
#> Accountants10>1YEAR,=2YEARS                         0.009218727
#> Accountants10>2YEARS,=4YEARS                        0.096899032
#> Accountants10>4YEARS,=10YEARS                       0.220977512
#> AdministrativeServicesManagers10>2YEARS,=4YEARS     0.094293299
#> AdministrativeServicesManagers10>4YEARS,=10YEARS    0.537543921
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS  0.672052267

Model Convergence. After running the models, we determine the iteration at which they converge. The predictions from this convergence iteration (calculated for both the SOC2 and SOC3 models) are eventually used to determine the blending ratio that will be used for the purposes of imputing missing values at a later step (see 4.2). Convergence is defined as when the difference in root mean square error (RMSE) between consecutive predictions is < 0.001. Note that in this vignette, because we are only using a subset of the data, the models never meet this convergence criteria. In this situation, the last iteration is returned as the convergence iteration (iteration 10, for both models).

computeConvergence(test.folds2)
#> RMSE by iteration:
#>  Prediction0  Prediction1  Prediction2  Prediction3  Prediction4  Prediction5 
#>   0.16333450   0.14508097   0.13534668   0.12786482   0.12147780   0.11619503 
#>  Prediction6  Prediction7  Prediction8  Prediction9 Prediction10 
#>   0.11170793   0.10764041   0.10411858   0.10070571   0.09774716 
#> 
#> Differences in RMSE:
#> 0.01825352 0.009734287 0.007481862 0.00638702 0.005282773 0.0044871 0.004067521 0.003521825 0.003412872 0.002958549
RMSE by iteration of SOC2 model.

Figure 3.1: RMSE by iteration of SOC2 model.

#> Convergence criteria unmet, returning final iteration
#> [1] 10
computeConvergence(test.folds3)
#> RMSE by iteration:
#>  Prediction0  Prediction1  Prediction2  Prediction3  Prediction4  Prediction5 
#>   0.17821356   0.14008939   0.13192531   0.12580777   0.12054885   0.11594333 
#>  Prediction6  Prediction7  Prediction8  Prediction9 Prediction10 
#>   0.11163077   0.10795266   0.10427675   0.10124913   0.09818892 
#> 
#> Differences in RMSE:
#> 0.03812417 0.008164084 0.006117544 0.00525891 0.004605525 0.004312562 0.003678107 0.003675912 0.003027623 0.00306021
RMSE by iteration of SOC3 model.

Figure 3.2: RMSE by iteration of SOC3 model.

#> Convergence criteria unmet, returning final iteration
#> [1] 10

Determine Blending Ratio. After determining the convergence iteration of each model, we then run an analysis to optimize the contribution of each model to the final prediction. The predictions from the two convergence iterations are blended in a ratio derived from minimizing the RMSE. From this analysis we arrived at a blending ratio of 51:49 (SOC2:SOC3). See below for the RMSE of predictions from each of the two analysis streams at the convergence iteration, as well as from the blended model. Figure 3.3 illustrates the blending ratio analysis. We report on this blended result in the next section. Importantly, this ensemble approach, and the ratio derived here, is also used throughout the remaining prediction steps, namely in 4.2.

blending.data <- computeBlendingRatio(soc2.mdl,soc3.mdl,print.plot=FALSE)
#> Computing convergence for model 1...
#> Convergence criteria unmet, returning final iteration
#> 
#> Computing convergence for model 2...
#> Convergence criteria unmet, returning final iteration
#> 
#> Convergence iteration for model 1: 10
#> Convergence iteration for model 2: 10

blending.data$soc2.proportion
#> [1] 0.51
blending.data$soc3.proportion
#> [1] 0.49

rmse(test.folds2$actual,test.folds2$Prediction10)
#> [1] 0.09774716
rmse(test.folds3$actual,test.folds3$Prediction10)
#> [1] 0.09818892
rmse(blending.data$test.folds.blended$actual,blending.data$test.folds.blended$Prediction10)
#> [1] 0.09414974
Calculation of model blending ratio. The predictions at the convergence iteration of each of the constituent models were combined in various ratios, and the RMSE of the resulting blended prediction was calculated. The ratio that minimized this RMSE was 51(SOC2):49(SOC3), represented by the red point on the plot.

Figure 3.3: Calculation of model blending ratio. The predictions at the convergence iteration of each of the constituent models were combined in various ratios, and the RMSE of the resulting blended prediction was calculated. The ratio that minimized this RMSE was 51(SOC2):49(SOC3), represented by the red point on the plot.

Final Ensemble and Plot of Iterative Process. A visualization of the iterative process can be seen in Fig. 3.4. This visualization uses the blended, ensemble data, calculated from the ratio derived in the previous section. The actual value versus prediction is plotted to show the progression from the initial smart guess to the (adjusted) prediction for the each iteration of the model. With each iteration, XGBoost is moving the predictions towards their actual values: predictions are swept toward the diagonal at each step. Note that the blue points, representing the SVP requirement category which contains the requirement with the highest number of possible observations (Minimum education level, with 9 levels), have the furthest to move and have the highest error. Keep in mind that because this vignette only addresses a subset of the data, the convergence to the actual values is less pronounced than would be with the full dataset.

# Plot progression of (blended) predictions
p23 <- plotTestFolds(blending.data$test.folds.blended,print.plot=FALSE)
ggpubr::ggarrange(plotlist=p23,nrow=ceiling(length(p23)/2),ncol=2,labels="auto",legend="none")
Predicted estimates vs. their actual values by iteration. We can see a convergence of the predictions toward their actual values. Grey corresponds to ENV, yellow to PHY, and cyan to SVP.

Figure 3.4: Predicted estimates vs. their actual values by iteration. We can see a convergence of the predictions toward their actual values. Grey corresponds to ENV, yellow to PHY, and cyan to SVP.

4 Imputation

Following model tuning, we address the original problem: imputation. This is carried out using the same iterative method utilized during tuning, operating on both the known data and missing data. Here, however, there is no mock missing data. Similar to the KFCV approach, the (actual) missing data is initialized with a prediction using smart guessing, and then XGBoost models are used to iteratively refine these predictions. However, it is important to capture some of the variability inherent in the data within the framework of imputation. To do this, we first create a set of simulated data, derived from the original known data. This is then used moving forward to generate a distribution of predictions for each the missing values, allowing for the calulation of confidence intervals rather than a single point estimate. The details of this procedure are demonstrated below.

4.1 Simulation (Data Uncertainty)

Simulated data is drawn from a \(\beta\) distribution using, for all known observations, the value field (\(\mu\)) for the mean and the std.error field (\(\sigma_e\)) as the standard deviation. The details of this calculation can be found in the full paper. As a result of this step, each known observation has, in addition to its original value, 10 simulated values associated with it, resulting in 10 simulated datasets.

The newly generated datasets were then run through the imputation process and each missing value produced a range of predictions, generating both a mean estimate and a distribution thereof. In this way, missing values that were more sensitive to the initial assumptions show wider variation in their range of predictions, leading to broader confidence intervals for the imputed missing values. The details of this analysis are described in the following section, 4.2.

ors.data.sims <- computeSimulations(ors.data)

4.2 Impute Missing Values

Recall that the known values are simulated 10 times, as per the procedure outlined in 4.1. The result of this is 10 distinct datasets (simulations) to perform imputation on. For each simulation, the data is first subjected to the smart guessing procedure. Then the full dataset (known and smart guessed estimates) is fed into an XGBoost fit, and the resulting model is used to predict the missing values. Note that, like in the KFCV step, two models are produced in this way: one using the SOC2 codes to guide smart guessing, and one using the SOC3 codes to do so. These models are then blended using the convergence iterations and the ratios computed in 3.2. See code and output below for details.

# Iterative imputing for SOC2 model
impute.soc2 <- iterateModel(ors.data.sims=ors.data.sims,n.iter=10,
                              weight.step=0.05,sg.soc.code="upSOC2")
#> [1] "Smart guessing for simulation 1 (Iteration 0)..."
#> [1] "Smart guessing for simulation 2 (Iteration 0)..."
#> [1] "Smart guessing for simulation 3 (Iteration 0)..."
#> [1] "Smart guessing for simulation 4 (Iteration 0)..."
#> [1] "Smart guessing for simulation 5 (Iteration 0)..."
#> [1] "Smart guessing for simulation 6 (Iteration 0)..."
#> [1] "Smart guessing for simulation 7 (Iteration 0)..."
#> [1] "Smart guessing for simulation 8 (Iteration 0)..."
#> [1] "Smart guessing for simulation 9 (Iteration 0)..."
#> [1] "Smart guessing for simulation 10 (Iteration 0)..."
#> [1] "-------ITERATING MODELS-------"
# Iterative imputing for SOC3 model
impute.soc3 <- iterateModel(ors.data.sims=ors.data.sims,n.iter=10,
                              weight.step=0.05,sg.soc.code="upSOC3")
#> [1] "Smart guessing for simulation 1 (Iteration 0)..."
#> [1] "Smart guessing for simulation 2 (Iteration 0)..."
#> [1] "Smart guessing for simulation 3 (Iteration 0)..."
#> [1] "Smart guessing for simulation 4 (Iteration 0)..."
#> [1] "Smart guessing for simulation 5 (Iteration 0)..."
#> [1] "Smart guessing for simulation 6 (Iteration 0)..."
#> [1] "Smart guessing for simulation 7 (Iteration 0)..."
#> [1] "Smart guessing for simulation 8 (Iteration 0)..."
#> [1] "Smart guessing for simulation 9 (Iteration 0)..."
#> [1] "Smart guessing for simulation 10 (Iteration 0)..."
#> [1] "-------ITERATING MODELS-------"
# Blend models using KFCV results
blended.model <- blendImputations(model.results.soc2=impute.soc2,
                                  model.results.soc3=impute.soc3,
                                  conv.iter.soc2=blending.data$soc2.conv.iter, 
                                  conv.iter.soc3=blending.data$soc3.conv.iter,
                                  soc2.prop=blending.data$soc2.proportion,
                                  soc3.prop=blending.data$soc3.proportion,
                                  ors.data.sims)
#> [1] "Calculating mean prediction at iteration 10"
#> [1] "Calculating mean prediction at iteration 10"

4.3 Confidence Intervals

After imputing over each of the simulations, we arrive at a distribution of 10 predictions per missing value. This allows us to calculate a mean prediction and its relevant 95% confidence interval for each of the missing values, a result which is more useful than a single point estimate. In Table 4.1 we provide a sample of these calculated confidence intervals (and associated mean predictions).

model.CIs <- computeCIs(blended.model)

CIs <- model.CIs$CIs
CIs <- cbind(ors.data.sims[rownames(CIs),c("occupation_text","data_element_text","data_type_text")],CIs)
Table 4.1: Sample 95% confidence intervals, calculated from predictions of missing estimates across 10 simulations.
Occupation Requirement Requirement level Mean prediction CI (lower) CI (upper)
Administrative Services Managers SVP > 1 MONTH, = 3 MONTHS 0.0078362 0.0061895 0.0094830
Public Relations and Fundraising Managers Reaching overhead SELDOM 0.0385467 0.0260287 0.0510646
Administrative Services Managers Lifting/carrying Occasionally NONE 0.3711352 0.3388069 0.4034634
Chief Executives Pushing/pulling: Hands/arms FREQUENTLY 0.0026718 0.0017270 0.0036167
Storage and Distribution Managers Gross manipulation OCCASIONALLY 0.6072535 0.5637946 0.6507124
Education Administrators, Postsecondary Kneeling OCCASIONALLY 0.0076856 0.0033822 0.0119890
Compliance Managers High, exposed places SELDOM 0.0047139 0.0038372 0.0055907
Sales Managers Hearing requirements: Other Sounds YES 0.0615373 0.0570968 0.0659777
Education Administrators, Postsecondary Minimum education level HIGH SCHOOL VOCATIONAL 0.0001393 -0.0000558 0.0003344
Credit Analysts Reaching overhead NOT PRESENT 0.9596998 0.9538858 0.9655137
Regulatory Affairs Specialists Lifting/carrying Seldom >1 LBS, <=10 POUNDS 0.5414520 0.5206967 0.5622072
Insurance Underwriters Crawling CONSTANTLY 0.0007694 0.0004728 0.0010659
Claims Examiners, Property and Casualty Insurance Hazardous contaminants SELDOM 0.0021681 0.0016925 0.0026437
Compensation, Benefits, and Job Analysis Specialists Stooping CONSTANTLY 0.0319217 0.0289224 0.0349210
Credit Analysts Heavy vibrations OCCASIONALLY 0.0004232 0.0002293 0.0006171
Logistics Analysts Hearing requirements: Group or conference NO 0.0897224 0.0790936 0.1003511
Market Research Analysts and Marketing Specialists Strength HEAVY WORK 0.0093089 0.0064523 0.0121656
Computer Programmers Reaching overhead OCCASIONALLY 0.0265749 0.0105536 0.0425963
Information Technology Project Managers Lifting/carrying Frequently >1 LBS, <=10 POUNDS 0.0247576 0.0143199 0.0351953
Database Administrators Gross manipulation NOT PRESENT 0.0263979 0.0057709 0.0470249
Information Technology Project Managers Extreme cold FREQUENTLY 0.0017380 0.0013339 0.0021420
Actuaries Hearing requirements: Group or conference YES 0.9629114 0.9555136 0.9703092
Environmental Scientists and Specialists, Including Health SVP > 6 MONTHS, = 1 YEAR 0.0203178 0.0161228 0.0245127
Life, Physical, and Social Science Occupations Pushing/pulling: Hands/arms OCCASIONALLY 0.0102104 0.0049567 0.0154641
Life, Physical, and Social Science Occupations Hazardous contaminants SELDOM 0.1625361 0.1463737 0.1786985
Chemical Technicians Minimum education level PROFESSIONAL 0.0394071 0.0335857 0.0452285

4.4 Model Uncertainty

Uncertainty in the data itself is captured using the simulation technique describe above. However, the models themselves are also a source of uncertainty. We note that during imputation, not only are the missing values predicted, but also the known ones. With each iteration, the known observations are reset to their actual values prior to predicting in the subsequent iteration, but the original predictions are used to try and roughly quantify the uncertainty contributed by the models. To do so, we calculate both the mean absolute error (MAE) and the mean error (ME) between the actual values of known observations and the predictions generated by the model. This is done for each of the 10 simulations. At convergence (iteration 10 for all 10 simulations), the average (mean) MAE across all 10 simulation is 0.001934767, and the average ME is 6.421957e-05.

model.uncert <- computeModelUncertainty(impute.soc2,impute.soc3,10,10,0.51,0.49)
model.uncert$avg.mae
#> [1] 0.001934767
model.uncert$avg.me
#> [1] 6.421957e-05

5 Further Analysis

The values imputed for the missing estimates in 4 can be used to gather and extrapolate insights from the data captured in the ORS. We address a handful of such examples, and the relevant functions in the imputeORS package, below.

5.1 Expected Level of Effort

The “Expected Level of Effort” (ELE) is a measure we derived to compare occupations. It is a weighted average of the frequency and intensity times the population estimate for the various requirements. A low frequency/low intensity/low population estimate results in a low level of effort, and the converse for high.

For each occupational group, ELE as an expected value \(E\) of frequency times intensity as follows, where \(\mu_j\) is the mean population prediction across all 10 simulations for the \(j^{th}\) observation, and \(F_j\) and \(I_j\) are the frequency and intensity of the \(j^{th}\) observation, respectively:

\[E=\sum_j{\mu_j*F_j*I_j}\]

expected.vals <- computeEVs(blended.model)

This measure can be used in a correlation calculation across occupations, as in Fig. 5.1, below. Further, it can be used to assess job similarity between different occupations (see 5.2.2 for more details). ELE can also be applied to improving survey design, but this is beyond the scope of this vignette (see full paper for further details).

htmp <- correlationPlot(blended.model,plot.dim=3,print.plot=FALSE)
htmp$plot.pub
Heatmap of occupational correlation of ELE, organized by 2-digit SOC code. Intra-group correlation (major diagonal) is generally higher than inter-group correlation.

Figure 5.1: Heatmap of occupational correlation of ELE, organized by 2-digit SOC code. Intra-group correlation (major diagonal) is generally higher than inter-group correlation.

5.2 Job Similarity

One of our proposed applications of the (imputed) ORS data is computing the similarity between different occupations. We describe two separate methods for doing so, detailed below in 5.2.1 and 5.2.2.

5.2.1 Overlap

Imputation generates a full distribution of population percentages for all the considered occupations. Thus, for two jobs, we can measure the overlap of these occupations by taking the product of their weightings to yield a value on the interval \([0,1]\). Further analysis of the requirement and the overlap can help lead to understanding of jobs that are similar or dissimilar.

If \(\omega_{1ir}\) is the population mean of the \(i^{th}\) level of the \(r^{th}\) requirement for Job 1 (average of simulation predictions for missing values, and actual value for known observations), and \(\omega_{2ir}\) is the same for Job 2, then we say that the overlap of \(r^{th}\) requirement (\(O_r\)) for these two jobs is:

\[O_r = \sum_{i}^{} \omega_{1ir}*\omega_{2ir}\] We demonstrate three overlap comparisons below, visualized in Fig. 5.2 - 5.4.

job.ovrlp1 <- computeOverlap(blended.model,"Accountants","Web Developers")
job.ovrlp2 <- computeOverlap(blended.model,"Accountants","Actuaries")
job.ovrlp3 <- computeOverlap(blended.model,"Accountants","Chemists")
Overlap of Accountants and Web Developers. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.

Figure 5.2: Overlap of Accountants and Web Developers. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.

Overlap of Accountants and Actuaries. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.

Figure 5.3: Overlap of Accountants and Actuaries. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.

Overlap of Accountants and Chemists. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.

Figure 5.4: Overlap of Accountants and Chemists. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.

5.2.2 Standardized ELE

Previously, in 5.1, we established a metric by which occupations could be compared across requirements in a correlation analysis. In Fig. 5.5, we standardize these expected values and plot their distributions by requirement. On the far right we have aggregated all the additive groups corresponding to the requirement Lifting/Carrying (24, 25, 26, and 27) into a single distribution labeled LC. The same was done for Reaching (additive groups 16 and 18), labeled R.

# Standardize EVs
std.EVs <- standardizeEVs(blended.model)

# Differences in standardized EVs
stdEV.diff1 <- computeStdEVdiff(blended.model,"Accountants","Web Developers")
stdEV.diff2 <- computeStdEVdiff(blended.model,"Accountants","Actuaries")
stdEV.diff3 <- computeStdEVdiff(blended.model,"Accountants","Chemists")
Boxplot of ELEs, standardized by additive group (requirement). Red points indicate standardized μE ’s (i.e., 0), and blue points indicate outliers. Requirements where μE and the median do not align are indicative of skew in the expected values.

Figure 5.5: Boxplot of ELEs, standardized by additive group (requirement). Red points indicate standardized μE ’s (i.e., 0), and blue points indicate outliers. Requirements where μE and the median do not align are indicative of skew in the expected values.

We note that these standardized ELEs provide another avenue for comparing job similarity, specifically by taking the difference in standardized ELEs between two occupations. Below are three figures demonstrating the differences between standardized expected values by occupation. Specifically, we visualize the same three pairs that were addressed in the overlap analysis, namely Accountants vs. Web Developers (Fig. 5.6), Accountants vs. Actuaries (Fig. 5.7), and Accountants vs. Chemists (Fig. 5.8).

Difference in standardized ELEs between Accountants and Web Developers.

Figure 5.6: Difference in standardized ELEs between Accountants and Web Developers.

Difference in standardized ELEs between Accountants and Actuaries.

Figure 5.7: Difference in standardized ELEs between Accountants and Actuaries.

Difference in standardized ELEs between Accountants and Chemists.

Figure 5.8: Difference in standardized ELEs between Accountants and Chemists.

6 Summary

In this vignette we demonstrate the use of the imputeORS package, a collection of methods by which to impute values for missing estimates within the Occupational Requirements Survey, administered by the U.S. Bureau of Labor Statistics. Our method leverages many of the inherent features of the data to arrive at a distribution of predictions. These features include the bounded nature of the data (all observations must be on the interval [0, 1]), negative correlation within an occupational group (percentages that must sum to 100%), quantified errors of known observations, and similarity between different occupations within a 2- or 3-digit SOC group.

We use known errors to generate 10 simulations of the known estimates in order to generate a distribution of predictions, which allows us to ultimately compute confidence levels for our imputations rather than single point estimates. We then utilize an iterative approach using XGBoost to generate our predictions for each simulation, iterating the models until they reached convergence. The predictions from the convergence iteration of each simulation are used to calculate a mean prediction and its 95% confidence interval.

Finally, we demonstrate a number of applications of the imputed values. Further details of the methods and applications described in this vignette can be found in the full paper. The full paper also describes a generalized, iterative imputation algorithm based on the procedures implemented in the imputeORS package.